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Abstract 

We consider supervised learning problems where the features are embedded in a graph, such as 
gene expressions in a gene network. In this context, it is of much interest to automatically select 
a subgraph with few connected components; by exploiting prior knowledge, one can indeed im- 
prove the prediction performance or obtain results that are easier to interpret. Regularization or 
penalty functions for selecting features in graphs have recently been proposed, but they raise new 
algorithmic challenges. For example, they typically require solving a combinatorially hard selec- 
tion problem among all connected subgraphs. In this paper, we propose computationally feasible 
strategies to select a sparse and well connected subset of features sitting on a directed acyclic graph 
(DAG). We introduce structured sparsity penalties over paths on a DAG called "path coding" penal- 
ties. Unlike existing regularization functions that model long range interactions between features 
in a graph, path coding penalties are tractable. The penalties and their proximal operators involve 
path selection problems, which we efficiently solve by leveraging network flow optimization. We 
experimentally show on synthetic, image, and genomic data that our approach is scalable and leads 
to more connected subgraphs than other regularization functions for graphs. 
Keywords: Convex and non-convex optimization, network flow optimization, graph sparsity. 



1. Introduction 

Supervised sparse estimation problems have been the topic of much research in statistical machine 
learning and signal processing. In high dimensional settings, restoring a signal or learning a model 
is often difficult without a priori knowledge. When the solution is known beforehand to be sparse — 
that is, has only a few non-zero coefficients, regularizing with sparsity-inducing penalties has been 
shown to provide better prediction and solutions that are easier to interpret. For that purpose, non- 
convex penalties and greedy algorithms have been proposed (Akaike, 1973; Schwarz, 1978; Rissa- 
nen, 1978; Mallat and Zhang, 1993; Fan and Li, 2001). More recently, convex relaxations such as 
the £i-norm (Tibshirani, 1996; Chen et al., 1999) and efficient algorithms have been developed (Os- 
borne et al., 2000; Nesterov, 2007; Beck and Teboulle, 2009; Wright et al., 2009). 

In this paper, we consider supervised learning problems where more information is available 
than just sparsity of the solution. More precisely, we assume that the features (or predictors) can be 
identified to the vertices of a graph, such as gene expressions in a gene network. In this context, it 
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can be desirable to take into account the graph structure in the regularization (Rapaport et al., 2007). 
In particular, we are interested in automatically identifying a subgraph with few connected compo- 
nents (Jacob et al., 2009; Huang et al., 2011), groups of genes involved in a disease for example. 
There are two equally important reasons for promoting the connectivity of the problem solution: 
either connectivity is a prior information, which might improve the prediction performance, or con- 
nected components may be easier to interpret than isolated variables are. 

Formally, let us consider a supervised sparse estimation problem involving p features, and let 
us assume that we are given an undirected or directed graph G = (V,E), where V is a vertex set 
identified to {1, . . . ,p}, and E C V x V is an arc (edge) set. Classical empirical risk minimization 
problems can be formulated as 

min[L(w) + ^(w)], (1) 

wgKp 

where w is a weight vector in M p , which we wish to estimate; L : W — > R is a convex loss function, 
and Q. : R p — > E is a regularization function. In order to obtain a sparse solution, Q. is often chosen 
to be the Iq- (cardinality of the support) or l\ -penalty. In this paper, we are also interested in 
encouraging the sparsity pattern of w (the set of non-zero coefficients) to form a subgraph of G with 
few connected components. 

To the best of our knowledge, penalties promoting the connectivity of sparsity patterns in a 
graph can be classified into two categories. The ones of the first category involve pairwise interac- 
tions terms between vertices linked by an arc (Cehver et al., 2008; Jacob et al., 2009; Chen et al., 
2011); each term encourages two neighbors in the graph to be simultaneously selected. Such reg- 
ularization functions usually lead to tractable optimization problems, but they do not model long 
range interactions between variables in the graph, and they do not promote large connected com- 
ponents. Penalties from the second category are more complex, and directly involve hard combi- 
natorial problems (Huang et al., 2011). As such, they cannot be used without approximations. The 
problem of finding tractable penalties that model long range interactions is therefore acute. The 
main contribution of our paper is a solution to this problem when the graph is directed and acyclic. 

Of much interest to us are the non-convex penalty of Huang et al. (201 1) and the convex penalty 
of Jacob et al. (2009). Given a pre-defined set of possibly overlapping groups of variables Q, these 
two structured sparsity-inducing regularization functions encourage a sparsity pattern to be in the 
union of a small number of groups from Q. Both penalties induce a similar regularization effect 
and are strongly related to each other. In fact, we show in Section 3 that the penalty of Jacob et al. 
(2009) can be interpreted as a convex relaxation of the non-convex penalty of Huang et al. (201 1). 
These two penalties go beyond classical unstructured sparsity, but they are also complex and raise 
new challenging combinatorial problems. For example, Huang et al. (2011) define Q as the set of 
all connected subgraphs of G, which leads to well connected solutions but also leads to intractable 
optimization problems; the latter are approximately addressed by Huang et al. (2011) with greedy 
algorithms. Jacob et al. (2009) choose a different strategy and define Q as the pairs of vertices linked 
by an arc, which, as a result, encourages neighbors in the graph to be simultaneously selected. This 
last formulation is computationally tractable, but does not model long range interactions between 
features. Another suggestion from Jacob et al. (2009) and Huang et al. (2011) consists of defin- 
ing Q as the set of connected subgraphs up to a size k. The number of such subgraphs is however 
exponential in k, making this approach difficult to use even for small subgraph sizes (k = 3,4) as 
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(a) Sparsity pattern in an undirected graph. (b) Selected paths in a DAG. 



Figure 1: Left (a): an undirected graph with 12 nodes. A sparsity pattern forming a subgraph with 
two connected components is represented by gray nodes. Right (b): when the graph is a DAG, the 
sparsity pattern is covered by two paths (2,3,6) and (9, 11, 12) represented by bold arrows. 

soon as the graph is large {pm 10000) and connected enough. 1 These observations naturally raise 
the question: can we replace connected subgraphs by another structure that is rich enough to model 
long-range interactions in the graph and leads to computationally feasible penalties? 

When the graph G is directed and acyclic, we propose a solution built upon two ideas. First, 
we use in the penalty framework of Jacob et al. (2009) and Huang et al. (2011) a novel group 
structure Q p that contains all the paths in G; a path is defined as a sequence of vertices (vi, . .. ,V£) 
such that for all 1 < i <k, we have (v,-,v !+ i) £E. The second idea is to use appropriate costs for 
each path (the "price" one has to pay to select a path), which, as we show in the sequel, allows us 
to leverage network flow optimization. We call the resulting regularization functions "path coding" 
penalties. They go beyond pairwise interactions between vertices and model long-range interactions 
between the variables in the graph. They encourage sparsity patterns forming subgraphs that can be 
covered by a small number of paths, therefore promoting connectivity of the solution. We illustrate 
the "path coding" concept for DAGs in Figure 1. Even though the number of paths in a DAG is 
exponential in the graph size, we map the path selection problems our penalties involve to network 
flow formulations (see Ahuja et al., 1993; Bertsekas, 1998), which can be solved in polynomial 
time. As shown in Section 3, we build minimum cost flow formulations such that sending a positive 
amount of flow along a path for minimizing a cost is equivalent to selecting the path. This allows us 
to efficiently compute the penalties and their proximal operators, a key tool to address regularized 
problems (see Bach et al., 2012, for a review). 

Therefore, we make in this paper a new link between structured graph penalties in DAGs and 
network flow optimization. The development of network flow optimization techniques has been 
very active from the 60's to the 90's (see Ford and Fulkerson, 1956; Goldberg and Tarjan, 1986; 
Ahuja et al., 1993; Goldberg, 1997; Bertsekas, 1998). They have attracted a lot of attention during 
the last decade in the computer vision community for their ability to solve large-scale combinato- 
rial problems typically arising in image segmentation tasks (Boykov et al., 2001). Concretely, by 
mapping a problem at hand to a network flow formulation, one can possibly obtain fast algorithms 

1. This issue was confirmed to us in a private communication with Laurent Jacob, and this was one of our main motiva- 
tion for developing new algorithmic tools overcoming this problem. 



3 



Mairal and Yu 



to solve the original problem. Of course, such a mapping does not always exist or can be difficult 
to find. This is made possible in the context of path coding penalties thanks to decomposability 
properties of the path costs, which we make explicit in Section 3. 

We remark that different network flow formulations have also been used recently for sparse es- 
timation (Cehver et al., 2008; Chambolle and Darbon, 2009; Hoefling, 2010; Mairal et al., 2011). 
Cehver et al. (2008) combine for example sparsity and Markov random fields for signal reconstruc- 
tion tasks. They introduce a non-convex penalty consisting of pairwise interaction terms between 
vertices of a graph, and their approach requires iteratively solving maximum flow problems. It has 
also been shown by Chambolle and Darbon (2009) and Hoefling (2010) that for the anisotropic 
total-variation penalty, called "fused lasso" in statistics, the solution to problem (1) can be obtained 
by solving a sequence of parametric maximum flow problems. The total-variation penalty can be 
useful to obtain piecewise constant solutions on a graph (see Chen et al., 2011). Finally, Mairal 
et al. (2011) have shown that the structured sparsity-inducing regularization function of Jenatton 
et al. (2011) is related to network flows in a similar way as the total variation penalty is. Note that 
both Jacob et al. (2009) and Jenatton et al. (2011) use the same terminology of "group Lasso with 
overlapping groups", leading to some confusion in the literature. Yet, their works are significantly 
different and are in fact complementary: given a group structure the penalty of Jacob et al. (2009) 
encourages solutions whose sparsity pattern is a union of a few groups, whereas the penalty of Je- 
natton et al. (201 1) promotes an intersection of groups. It is natural to use the framework of Jacob 
et al. (2009) to encourage connectivity of a problem solution in a graph, e.g., by choosing Q as the 
pairs of vertices linked by arc. It is however not obvious how to obtain this effect with the penalty 
of Jenatton et al. (2011). We discuss this question in more details in Appendix A. 

To summarize, we have designed non-convex and convex penalty functions to do feature selec- 
tion in directed acyclic graphs. Because our penalties involve an exponential number of variables, 
one for every path in the graph, existing optimization techniques cannot be used. To deal with this 
issue, we introduce network flow optimization tools that implicitly handle the exponential number 
of paths, allowing the penalties and their proximal operators to be computed in polynomial time. As 
a result, our penalties can model long-range interactions in the graph and are tractable. 

The paper is organized as follows: Section 2 presents preliminary tools, notably a brief intro- 
duction to network flows. Section 3 proposes the path coding penalties and devises optimization 
techniques for solving the corresponding sparse estimation problems. Section 4 is devoted to exper- 
iments on synthetic, genomic, and image data to demonstrate the benefits of path coding penalties 
over existing ones and the scalability of our approach. Section 5 concludes the paper. 

2. Preliminaries 

As we show later, our path coding penalties are linked to the concept of flow in a graph. Since this 
concept is not widely used in the machine learning literature, we provide a brief overview of this 
topic in Section 2.1. In Section 2.2, we also present proximal gradient methods, which have become 
very popular for solving sparse regularized problems (see Bach et al., 2012). 

2.1 Network Flow Optimization 

Network flows have been well studied in the computer science community, and have led to efficient 
dedicated algorithms for solving particular linear programs (see Ahuja et al., 1993; Bertsekas, 1998). 
Let us consider a directed graph G = (V,E) with two special nodes s and t, respectively dubbed 
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(a) A flow in a DAG. 



(b) A flow in a directed graph with a cycle. 




(c) (i-.f)-path flow in a DAG. 



(d) A cycle flow in a directed graph. 



Figure 2: Examples of flows in a graph, (a) The flow on the DAG can be interpreted as two units of 
flow sent from s to t along the paths (s, 1,3,4,?) and (s,2,3,4,t). (b) The flow can be interpreted as 
two units of flow sent from s to t on the same paths as in (a) plus a unit of flow circulating along the 
cycle (1,3,2, 1). (c) (j,f)-pafh flow along the path 0,2,3,4,/). (d) Cycle flow along (1,3,2,1). 



source and sink. A flow f on the graph G is defined as a non-negative function on arcs \f U \]( u ,v)eE 
that satisfies two sets of linear constraints: 

• capacity constraints: the value of the flow f uv on an arc (u,v) in E should satisfy the con- 
straint l uv < f uv < 8 MV , where l uv and § MV are respectively called lower and upper capacities; 

• conservation constraints: the sum of incoming flow at a vertex is equal to the sum of outgo- 
ing flow except for the source s and the sink t. 

We present examples of flows in Figures 2a and 2b, and denote the set of flows on a graph G by jF. 
We remark that with appropriate graph transformations, the flow definition we have given admits 
several variants. It is indeed possible to consider several source and sink nodes, capacity constraints 
on the amount of flow going through vertices, or several arcs with different capacities between two 
vertices (see more details in Ahuja et al., 1993). 

Some network flow problems have attracted a lot of attention because of their wide range of 
applications, for example in engineering, physics, transportation, or telecommunications (see Ahuja 
et al., 1993). In particular, the maximum flow problem consists of computing how much flow can be 
sent from the source to the sink through the network (Ford and Fulkerson, 1956). In other words, it 
consists of finding a flow fro. 'J maximizing EweV:(i,w)e£ fsu- Another more general problem, which 
is of interest to us, is the minimum cost flow problem. It consists of finding a flow finff minimizing 
a linear cost Y.(u,v)eE c uvfm>, where every arc (u, v) in E has a cost c uv in E. Both the maximum flow 
and minimum cost flow problems are linear programs, and can therefore be solved using generic 
linear programming tools, e.g., interior points methods (see Boyd and Vandenberghe, 2004; Nocedal 
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and Wright, 2006). Dedicated algorithms exploiting the network structure of flows have however 
proven to be much more efficient. It has indeed been shown that minimum cost flow problems 
can be solved in strongly polynomial time — that is, an exact solution can be obtained in a finite 
number of steps that is polynomial in |V| and \E\ (see Ahuja et al., 1993). More important, these 
dedicated algorithms are empirically efficient and can often handle large-scale problems (Goldberg 
and Tarjan, 1986; Goldberg, 1997; Boykov et al., 2001). 

Among linear programs, network flow problems have a few distinctive features. The most strik- 
ing one is the "physical" interpretation of a flow as a sum of quantities circulating in the network. 
The flow decomposition theorem (see Ahuja et al., 1993, Theorem 3.5) makes this interpretation 
more precise by saying that every flow vector can always be decomposed into a sum of (s,f)-path 
flows (units of flow sent from s to t along a path) and cycle flows (units of flow circulating along 
a cycle in the graph). We give examples of (s,?)-path and cycle flow in Figures 2c and 2d, and 
present examples of flows in Figures 2a and 2b along with their decompositions in (,y,f)-pafh and 
cycle flows. Built upon the interpretation of flows as quantities circulating in the network, efficient 
algorithms have been developed, e.g., the classical augmenting path algorithm of Ford and Fulker- 
son (1956) for solving maximum flow problems. Another feature of flow problems is the locality 
of the constraints; each one only involves neighbors of a vertex in the graph. This locality is also 
exploited to design algorithms (Goldberg and Tarjan, 1986; Goldberg, 1997). Finally, minimum 
cost flow problems have a remarkable integrality property: a minimum cost flow problem where all 
capacity constraints are integers can be shown to have an integral solution (see Ahuja et al., 1993). 

Later in our paper, we will map path selection problems to network flows by exploiting the 
flow decomposition theorem. In a nutshell, this apparently simple theorem has an interesting con- 
sequence: minimum cost flow problems can be seen from two equivalent viewpoints. Either one is 
looking for the value f uv of a flow on every arc (u,v) of a graph minimizing the cost Y,(u.v)eE c uvfuv> 
or one is looking for the quantity of flow that should circulate on every (^fj-path and cycle flow 
for minimizing the same cost. Of course, when the graph G is a DAG, cycle flows do not exist. We 
will define flow problems such that selecting a path in the context of our path coding penalties is 
equivalent to sending some flow along a corresponding (s,?)-pafh. We will also exploit the integral- 
ity property to develop tools both adapted to non-convex penalties and convex ones, respectively 
involving discrete and continuous optimization problems. With these tools in hand, we will be able 
to deal efficiently with a simple class of optimization problems involving our path coding penalties. 
To deal with the more complex problem ( 1 ), we will need additional tools, which we now present. 

2.2 Proximal Gradient Methods 

Proximal gradient methods are iterative schemes for minimizing objective functions of the same 
form as (1), when the function L is convex and differentiable with a Lipschitz continuous gradient. 
The simplest proximal gradient method consists of linearizing at each iteration the function L around 
a current estimate w, and this estimate is updated as the (unique by strong convexity) solution to 



which is assumed to be easier to solve than the original problem ( 1 ). The quadratic term keeps the 
update in a neighborhood where L is close to its linear approximation, and the parameter p is an 
upper bound on the Lipschitz constant of VL. When Q. is convex, this scheme is known to converge 




quadratic term 



(2) 
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to the solution to problem (1) and admits variants with optimal convergence rates among first-order 
methods (Nesterov, 2007; Beck and Teboulle, 2009). When Q. is non-convex, the guarantees are 
weak (finding the global optimum is out of reach), but it is easy to show that these updates can be 
seen as a majorization-minimization algorithm (see Hunter and Lange, 2004) iteratively decreasing 
the value of the objective function (Wright et al., 2009). When £2 is the l\- or £o-P ena lty, the corre- 
sponding optimization schemes (2) are respectively known as iterative soft- and hard-thresholding 
algorithms (Daubechies et al., 2004; Blumensath and Davies, 2009). Note that when L is not differ- 
entiable, similar schemes exist, known as mirror-descent (Nemirovsky and Yudin, 1983). 
Another insight about these methods can be obtained by rewriting sub-problem (2) as 



mm 



w — -VL(w) — w 
P 



2 X 

+ -Q(w) 

2 P 



When X = 0, the solution is obtained by a classical gradient step w «— w — (l/p)VL(w). Thus, 
proximal gradient methods can be interpreted as a generalization of gradient descent algorithms 
when dealing with a nonsmooth term. They are, however, only interesting when problem (2) can be 
efficiently solved. Formally, we wish to be able to compute the proximal operator defined as: 

Definition 1 (Proximal Operator.) 

The proximal operator associated with a regularization term XGl, which we denote by Prox^Q, is the 
function that maps a vector u E M p to the unique (by strong convexity) solution to 



mm 

weRp 



1 



u — w||2 + Ai2(w) 



(3) 



Computing efficiently this operator has been shown to be possible for many penalties Q. (see Bach 
et al., 2012). We will show in the sequel that it is also possible for our path coding penalties. 



3. Sparse Estimation in Graphs with Path Coding Penalties 

We now present our path coding penalties, which exploit the structured sparsity frameworks of Jacob 
et al. (2009) and Huang et al. (201 1). Because we choose a group structure Q p with an exponential 
number of groups, one for every path in the graph, the optimization techniques presented by Jacob 
et al. (2009) or Huang et al. (2011) cannot be used anymore. We will deal with this issue by 
introducing flow definitions of the path coding penalties. 

3.1 Path Coding Penalties 

The so called "block coding" penalty of Huang et al. (201 1) can be written for a vector w in W and 
any set Q of groups of variables as 

cp^(w) = min{ £r| g s.t. Supp(w) C \Jg\, (4) 

where the r^'s are non-negative weights, and J is a subset of groups in Q whose union covers the 
support of w formally defined as Supp(w) = {j € {1, . . . ,p} : w ; - ^ 0}. When the weights are well 
chosen, the non-convex penalty cp^ encourages solutions w whose support is in the union of a small 
number of groups; in other words, the cardinality of J should be small. We remark that Huang et al. 
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(2011) originally introduce this regularization function under a more general information-theoretic 
point of view where cp^ is a code length (see Barron et al., 1998; Cover and Thomas, 2006), and the 
weights r\ g represent the number of bits encoding the fact that a group g is selected. 2 One motivation 
for using cp^ is that the selection of a few groups might be easier to inteipret than the selection of 
isolated variables. This formulation extends non-convex group sparsity regularization by allowing 
any group structure Q to be considered. Nevertheless, a major drawback is that computing this non- 
convex penalty cp^(w) for a general group structure Q is difficult. Equation (4) is indeed an instance 
of a set cover problem, which is NP-hard (see Cormen et al., 2001), and appropriate approximations, 
e.g., greedy algorithms, have to be used in practice. 

As often when dealing with non-convex penalties, one can either try to solve directly the cor- 
responding non-convex problems or look for a convex relaxation. As we empirically show in Sec- 
tion 4, having both non-convex and convex variants of a penalty can be a significant asset. One 
variant can indeed outperform the other one in some situations, while being the other way around in 
some other cases. It is therefore interesting to look for a convex relaxation of cp^. We denote by r\ 

the vector [T|g]ge£ in M.f , and by N the binary matrix in {0, 1}p x I£I whose columns are indexed by 
the groups g in Q, such that the entry Njg is equal to one when the index j is in the group g, and 
zero otherwise. Equation (4) can be rewritten as a Boolean linear program, a form which will be 
more convenient in the rest of the paper: 

<Pg(w) = min \r\ T x s.t. Nx > Supp(w) \ , (5) 

xe{0,i}l5l > 

where, with an abuse of notation, Supp(w) is here a vector in {0, 1} P such that its 7-th entry is 1 
if j is in the support of w and otherwise. Let us also denote by |w[ the vector in R'| obtained by 
replacing the entries of w by their absolute value. We can now consider a convex relaxation of (p^: 

V^(w) = m i n \ r \ Tx s -t- Nx>|w|>, (6) 

xeftf L J 

where not only the optimization problem above is a linear program, but in addition \|/^ is a convex 
function — in fact it can be shown to be a norm. Such a relaxation is classical and corresponds to the 
same mechanism relating the £q- to the l\ -penalty, replacing Supp(w) by jw|. The next lemma tells 
us that we have in fact obtained a variant of the penalty introduced by Jacob et al. (2009). 

Lemma 1 (Relation Between \\fg and the Penalty of Jacob et al. (2009).) 

Suppose that any pattern in {0, l} 7 ' can be represented by a union of groups in Q. Then, the 
function \\fg defined in (6) is equal to the penalty of Jacob et al. (2009) with l^-norms. 

Note that Jacob et al. (2009) have introduced their penalty from a different perspective, and the 
link between (6) and their work is not obvious at first sight. In addition, their penalty involves a 
sum of ^2-norms, which needs to be replaced by ^-norms for the lemma to hold. Hence, \|/^ is a 
"variant" of the penalty of Jacob et al. (2009). We give more details and the proof of this lemma in 
Appendix B. 

2. Note that Huang et al. (201 1) do not directly use the function (fig as a regularization function. The "coding complex- 
ity" they introduce for a vector w counts the number of bits to code the support of w, which is achieved by (pg, but 
also use an i'o-penalty to count the number of bits encoding the values of the non-zero coefficients in w. 

3. At the same time as us, Obozinski and Bach (2012) have studied a larger class of non-convex combinatorial penalties 
and their corresponding convex relaxations, obtaining in particular a more general result than Lemma 1, showing 
that Vfg is the tightest convex relaxation of (pg. 
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Now that q>g and \\fg have been introduced, we are interested in automatically selecting a small 
number of connected subgraphs from a directed acyclic graph G = (V,E). In Section 1, we already 
discussed group structures Q and introduced Q p the set of paths in G. As a result, the path coding 
penalties cp^ and \\fg encourage solutions that are sparse while forming a subgraph that can be 
covered by a small number of paths. As we show in this section, this choice leads to tractable 
formulations when the weights r\ g for every path g in Q p are appropriately chosen. 

We will show in the sequel that a natural choice is to define for all g in Q p 

Tl*=Y+|s|, (7) 

where y is a new parameter encouraging the connectivity of the solution whereas \g\ encourages 
sparsity. It is indeed possible to show that when y = 0, the functions cp^ and \\Tg respectively 
become the Iq- and the i\ -penalties, therefore encouraging sparsity but not connectivity. On the 
other hand, when y is large and the term \g\ is negligible, q>g (yv) simply "counts" how many paths 
are required to cover the support of w, thereby encouraging connectivity regardless of the sparsity 
of w. 

In fact, the choice (7) is a particular case of a more general class of weights r| g , which our algo- 
rithmic framework can handle. Let us enrich the original directed acyclic graph G by introducing a 
source node s and a sink node t. Formally, we define a new graph G' = (V',E') with 

v'^u{^}, 

E' = EU{(s,v) ;veV}U{(u,t) : u e V}. 

In plain words, the graph G' , which is a DAG, contains the graph G and two nodes s, t that are linked 
to every vertices of G. Let us also assume that some costs c uv in R are defined for all arcs (w,v) 
in E'. Then, for a path g = (u\ , U2, ■ ■ ■ , «/t) in Q p , we define the weight r\ g as 

% = c m + ( c UiUi+i J + c u k t = 52 C " v ' ^ 
!=1 ' (u,v)e(s,g,t) 

where the notation (s,g,t) stands for the path (s,u\ ,112, ■ ■ ■ ,Uk,t) in G'. The decomposition of the 
weights r\ g as a sum of costs on (s, ?)-paths of G' (the paths (s, g, t) with g in Q p ) is a key component 
of the algorithmic framework we present next. The construction of the graph G' is illustrated in 
Figures 3a and 3b for two cost configurations. We remark that the simple choice of weights (7) cor- 
responds to the choice (8) with the costs c su = yfor all u in V and c uv = 1 otherwise (see Figure 3a). 
Designing costs c m that go beyond the simple choice (7) can be useful whenever one has additional 
knowledge about the graph structure. For example, we experimentally exploit this property in Sec- 
tion 4.2 to privilege or penalize paths g in Q p starting from a particular vertex. This is illustrated 
in Figure 3b where the cost on the arc (s, 1) is much smaller than on the arcs (5, 2), (s,3), (s,4), 
therefore encouraging paths starting from vertex 1. 

Another interpretation connecting the path-coding penalties with coding lengths and random 
walks can be drawn using information theoretic arguments derived from Huang et al. (2011). We 
find these connections interesting, but for simplicity only present them in Appendix C. In the next 
sections, we address the following issues: (i) how to compute the penalties q>g and \\Tg given a 
vector w in M p ? (ii) how to optimize the objective function (1)? (hi) in the convex case (when Q. = 
i\fg p ), can we obtain practical optimality guarantees via a duality gap? All of these questions will be 
answered using network flow and convex optimization, or algorithmic tools on graphs. 
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1: 



(a) Graph G' with arcs costs and a path g in bold red. 



t 



(b) Graph G' with different arcs costs and a path g. 



Figure 3: (a) G 1 is obtained by adding a source s and sink t to a DAG with four nodes. The cost 
configuration is such that the weights r| g satisfy r| g = y+ \g\. For example, for g = (4, 2, 3), the sum 
of costs along (s,g,t) is r\ g = y+3. (b) Same graph G' as (a) but with different costs. The weight r\ g 
associated to the path g = (1,2) is the sum of costs along (s, l,2,f) — that is, r\ g = 4. 



3.2 Flow Definitions of the Path Coding Penalties 

Before precisely stating the flow definitions of q>g and tyg , let us sketch the main ideas. The 
first key component is to transform the optimization problems (4) and (6) over the paths in G into 
optimization problems over (s,t) -path flows in G' . We recall that (s^-path flows are defined as 
flow vectors carrying the same positive value on every arc of a path between s and t. It intuitively 
corresponds to sending from s to t a positive amount of flow along a path, an interpretation we 
have presented in Figure 2 from Section 2.1. Then, we use the flow decomposition theorem (see 
Section 2.1), which provides two equivalent viewpoints for solving a minimum cost flow problem 
on a DAG. One should be either looking for the value f m , of a flow on every arc (u,v) of the graph, 
or one should decide how much flow should be sent on every (s,?)-path. 

We assume that a cost configuration [c MV ](H,v)eE' * s available and that the weights r\ g are defined 
according to Equation (8). We denote by J the set of flows on G'. The second key component of 
our approach is the fact that the cost of a flow / in ^ sending one unit from s to t along a path g in G, 
defined as Y.(u,v)€E' fw°uv = Y,(u,v)<={s,g,t) c uv is exactly r\ g , according to Equation (8). This enables 
us to reformulate our optimization problems (4) and (6) on paths in G as optimization problems 
on (s,f)-path flows in G', which in turn are equivalent to minimum cost flow problems and can be 
solved in polynomial time. Note that this equivalence does not hold when we have cycle flows (see 
Figure 2d), and this is the reason why we have assumed G to be acyclic. 

We can now formally state the mappings between the penalties cp^ and \\Tg on one hand, and 
network flows on the other hand. An important quantity in the upcoming propositions is the amount 
of flow going through a vertex j in V = {1, . . . ,p}, which we denote by 

Sjif) = I U 
ueV:(u,j)eE' 
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Our formulations involve capacity constraints and costs for sj(f), which can be handled by network 
flow solvers; in fact, a vertex can always be equivalently replaced in the network by two vertices, 
linked by an arc that carries the flow quantity sj(f) (Ahuja et al., 1993). The main propositions are 
presented below, and the proofs are given in Appendix D. 

Proposition 1 (Computing (pg .) 

Let w be in W. Consider the network G' defined in Section 3.1 with costs [c U v](u,v)eE'> an d define r\ g 
as in (8). Then, 




(f>£ ( ,(w) = miiW £ f uv c uv s.t. Sj(f)>\, Vj G Supp(w) \ , (9) 



where J is the set of flows on G 1 . This is a minimum cost flow problem with some lower-capacity 
constraints, which can be computed in strongly polynomial time. 4 

Given the definition of the penalty cp^ in Eq. (5), computing cpg seems challenging for two reasons: 

(i) Eq. (5) is for a general group structure Q a NP-hard Boolean linear program with \ Q\ variables; 

(ii) the size of Q p is exponential in the graph size. Interestingly, Proposition 1 tells us that these two 
difficulties can be overcome when Q = Q p and that the non-convex penalty cpg can be computed in 
polynomial time by solving the convex optimization problem defined in Eq. (9). The key component 
to obtain the flow definition of q>g is the decomposability property of the weights defined in (8). 
This allows us to identify the cost of sending one unit of flow in G' from s to t along a path g to 
the cost of selecting the path g in the context of the path coding penalty (pg . We now show that the 
same methodology applies to the convex penalty \|/g . 

Proposition 2 (Computing \\fg .) 

Let w be in W. Consider the network G' defined in Section 3.1 with costs [cuv](u,v)eE'> an d define r\ g 
as in (8). Then, 

¥fi,(w)=min< £ f uv c uv t. Sj C/)>|w J -| ) vy€{i ) ...,p}i ) (io) 



.("jV)GB' 



where f is the set of flows on G'. This is a minimum cost flow problem with some lower-capacity 
constraints, which can be computed in strongly polynomial time. 

From the similarity between Equations (9) and (10), it is easy to see that \\fg and cpg are closely 
related, one being a convex relaxation of the other as explained in Section 3.1. We have shown 
here that q>g and \\fg can be computed in polynomial time and will discuss in Section 3.4 practical 
algorithms to do it in practice. Before that, we address the problem of optimizing (1). 

3.3 Using Proximal Gradient Methods with the Path Coding Penalties 

To address the regularized problem ( 1), we use proximal gradient methods, which we have presented 
in Section 2.2. We need for that to compute the proximal operators of cpg and \\fg from Definition 1. 
We show that this operator can be efficiently computed using network flow optimization. 



4. See the definition of "strongly polynomial time" in Section 2.1. 
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Proposition 3 (Computing the Proximal Operator of cp^ .) 

Let u be in W. Consider the network G' defined in Section 3.1 with costs [cuv](u,v)eE'> an d define T\ g 
as in (8). Let us define 

reargmini £ f w c uv + £ -max (uj(l - Sj(f)),0) \ , (11) 
/e5 {( u ,v)eE> j=i z J 

where 'J is the set of flows on G'. This is a minimum cost flow problem, with piecewise linear costs, 
which can be computed in strongly polynomial time. Denoting by w* =Prox^^ p [u], we have for all 
j in V = {1, . . . ,p} that Wy = Uj if Sf{f*) > and otherwise. 

Note that even though the formulation (3) is non-convex when Q. is the function cp^ , its global 
optimum can be found by solving the convex problem described in Equation (11). As before, the 
key component to establish the mapping to a network flow problem is the decomposability property 
of the weights r\ g . More details are provided in the proofs of Appendix D. Note also that any 
minimum cost flow problem with convex piecewise linear costs can be equivalently recast as a 
classical minimum cost flow problem with linear costs (see Ahuja et al., 1993), and therefore the 
above problem can be solved in strongly polynomial time. We now present similar results for \\fg . 

Proposition 4 (Computing the Proximal Operator of .) 

Let u be in M p . Consider the network G' defined in Section 3.1 with costs [c U v](u,v)eE'> an d define r\ g 
as in (8). Let us define 

r p i 2 \ 

/*€argmnW £ fu V c uv + £ ^max (\uj\ -sj(f),6) >, (12) 
/eJ {(u,v)eE' j=i z J 

where J is the set of flows on G '. This is a minimum cost flow problem, with piecewise quadratic 
costs, which can be computed in polynomial time. Denoting by w* = Proxy^ [u], we have for all j 
in V ={!,..., p}, w* = sign(uj)min(|u ; -|,^(/*)). 

The proof of this proposition is presented in Appendix D. We remark that we are dealing in Propo- 
sition 4 with a minimum cost flow problem with quadratic costs, which is more difficult to solve 
than when the costs are linear. Such problems with quadratic costs can be solved in weakly (in- 
stead of strongly) polynomial time (see Hochbaum, 2007) — that is, a time polynomial in \E\ 
and log(||u||oo/e) to obtain an £-accurate solution to problem (12), where £ can possibly be set 
to the machine precision. We have therefore shown that the computations of (pg p , ^fg p , Prox^ 
and Prox v ^ can be done in polynomial time. We now discuss practical algorithms, which have 
empirically shown to be efficient and scalable (Goldberg, 1997; Bertsekas, 1998). 

3.4 Practical Algorithms for Solving the Network Flow Problems 

The minimum cost flow problems involved in the computations of q>g p , \\fg p and ProX(p ft can be 
solved in the worst-case with 0((|V| log |V|)(|iJ| + | V | log \V\)) operations (see Ahuja et al., 1993). 
However, this analysis corresponds to the worst-case possible and the empirical complexity of net- 
work flow solvers is often much better (Boykov et al., 2001). Instead of a strongly polynomial 
algorithm, we have chosen to implement the scaling push-relabel algorithm (Goldberg, 1997), also 
known as an £-relaxation method (Bertsekas, 1998). This algorithm is indeed empirically efficient 
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despite its weakly polynomial worst-case complexity. It requires transforming the capacities and 
costs of the minimum cost flow problems into integers with an appropriate scaling and rounding 
procedure, and denoting by C the (integer) value of the maximum cost in the network its worst-case 
complexity is f?(|V| 2 |£'|log(C|\ / |)). This algorithm is appealing because of its empirical efficiency 
when the right heuristics are used (Goldberg, 1997). We choose C to be as large as possible (using 
64 bits integers) not to lose numerical precision, even though choosing C according to the desired 
statistical precision and the robustness of the proximal gradient algorithms would be more appro- 
priate. It has indeed been shown recently by Schmidt et al. (2011) that proximal gradient methods 
for convex optimization are robust to inexact computations of the proximal operator, as long as the 
precision of these computations iteratively increases with an appropriate rate. 

Computing the proximal operator Prox Vft [u] requires dealing with piecewise quadratic costs, 
which are more complicated to deal with than linear costs. Fortunately, cost scaling or £-relaxation 
techniques can be modified to handle any convex costs, while keeping a polynomial complex- 
ity (Bertsekas, 1998). Concisely describing £-relaxation algorithms is difficult because their con- 
vergence properties do not come from classical convex optimization theory. We present here an 
interpretation of these methods, but we refer the reader to Chapter 9 of Bertsekas (1998) for more 
details and implementation issues. In a nutshell, consider a primal convex cost flow problem 
min^ e j^( u v ) e£ C MV (/ l(V ), where the functions C uv are convex, and without capacity constraints. 
Using classical Lagrangian duality, it is possible to obtain the following dual formulation 

max V q uv (% u - ji v ) , where q m {% u - % v ) = min [C m (f uv ) - (%„ - TZ v )f uv ] ■ (13) 

This formulation is unconstrained and involves for each node uinVa dual variable n u in R, which 
is called the price of node u. £-relaxation techniques rely on this dual formulation, and can be 
interpreted as approximate dual coordinate ascent algorithms. They exploit the network structure to 
perform computationally cheap updates of the dual and primal variables, and can deal with the fact 
that the functions q uv are concave but not differentiable in general. Presenting how this is achieved 
exactly would be too long for this paper; we instead refer the reader to Chapter 9 of Bertsekas 
(1998). In the next section, we introduce algorithms to compute the dual norm of \\fg ■ , which is an 
important quantity to obtain optimality guarantees for (1) with CI = \\fg , or for implementing active 
set methods that are adapted to very large-scale very sparse problems (see Bach et al., 2012). 

3.5 Computing the Dual-Norm of Vfg p 

The dual norm of the norm is defined for any vector K in W as \|/^ (k) = max^ ft 7 ( w )<i w T K 
(see Boyd and Vandenberghe, 2004). We show in this section that can be computed efficiently. 

Proposition 5 (Computing the Dual Norm y*g .) 

Let K be in W. Consider the network G' defined in Section 3.1 with costs [c U v](u,v)eE'> an d define r\ g 
as in (8). For X > 0, and all j in {1, . . . ,/?}, we define an additional cost for the vertex j to be 
— |Ky|/x. We then define for every path g in Q p , the length li(g) to be the sum of the costs along the 
corresponding (s,t)-pathfrom G'. Then, 

= min i x s.t. min l x (g) > I , 
y ' -CGIR+ [ geg p J 

and\\f*g p (K) is the smallest factor X such that the shortest (s,t)-path on G' has nonnegative length. 
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The proof is given in Appendix D. We note that the above quantity U{g) satisfies l%(g) = T[ g — 
1 1 Kg || i /x, for every x > and K in W '. We present a simple way for computing \|/^ in Algorithm 1, 
which is proven in Proposition 6 to be correct and to converge in polynomial time. 

Algorithm 1 Computation of the Dual Norm \|/^ . 
input K € M p such that K ^ 0. 



Choose any path g € Q p such that K g ^ 0; 

8 < °°; 

while 8 < do 

g <— argmin fte ^ h{h); (shortest path problem in a directed acyclic graph); 

8- k(g)-> 
end while 

Return: x = \|/^ p (k) (value of the dual norm). 



Proposition 6 (Correctness and Complexity of Algorithm 1.) 

For K in W\ the algorithm 1 computes (k) in at most 0(p\E'\) operations. 

The proof is also presented in Appendix D. We note that this worst-case complexity bound might 
be loose. We have indeed observed in our experiments that the empirical complexity is close to be 
linear in \E'\. To concretely illustrate why computing the dual norm can be useful, we now give 
optimality conditions for problem (1) involving \|/^ . The following lemma can immediately be 
derived from (Bach et al., 2012, Proposition 1.2). 

Lemma 2 (Optimality Conditions for Problem (1) with CI = \|/^.) 

A vector w be in W is optimal for problem (1) with Q. = \|/^ if and only if 

\|^(VL(w)) <X and - VL(w) T w = A,\|^(w). 

The next section presents an active-set type of algorithm (see Bach et al., 2012, Chapter 6) building 
upon these optimality conditions and adapted to our penalty \\Tg . 

3.6 Active Set Methods for Solving Problem (1) when D. = \\fg 

As experimentally shown later in Section 4, proximal gradient methods allows us to efficiently 
solve medium-large/scale problems (p < 100000). Solving larger scale problems can, however, 
be more difficult. Algorithm 2 is an active-set strategy that can overcome this issue when the 
solution is very sparse. It consists of solving a sequence of smaller instances of Equation (1) on 
subgraphs G = (V,E), with V QV and E C E, which we call active graphs. It is based on the 
computation of the dual-norm \y*g which we have observed can empirically be obtained in a time 

linear or close to linear in \E'\. Given such a subgraph G, we denote by Q p the set of paths in G. 
The subproblems the active set strategy involve are the following: 

mm (L(w) + h\fg p (yf) s.t. wj = for aU j <£ V}. (14) 
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The key observations are that (i) when G is small, subproblem (14) is easy to solve; (ii) after solv- 
ing (14), one can check optimality conditions for problem (1) using Lemma 2, and update G accord- 
ingly. Algorithm 2 presents the full algorithm, and the next proposition ensures that it is correct. 



Algorithm 2 Active-set Algorithm for Solving Equation (1) with Q. = \\fg . 

1: Initialization w +- 0; G «— (0,0) (active graph); 

2: loop 

3: Update w by solving subproblem (14) (using the current value of w as a warm start); 
4: Compute x <— \\f*g (VL(w)) (using Algorithm 1); 
5: if x < X then 
6: exit the loop; 
7: else 

8: g <— argmin^^ l T (g) (shortest path problem in a directed acyclic graph); 

9: V ^- V Ug; E ■(r- E L) {(u,v) £ E : u £ g and v G g} (update of the active graph); 
10: end if 

ll: end loop 

12: Return: w* +- w, solution to Equation (1). 



Proposition 7 (Correctness of Algorithm 2.) 

Algorithm 2 solves Equation (1) when D. = \|/^ p . 

The proof is presented in Appendix D. It mainly relies on Lemma 2, which requires computing the 
quantity \|/^ (VL(w)). More precisely, it can be shown that when w is a solution to subproblem (14) 
for a subgraph G, whenever \|/^ (VL(w)) < X, it is also a solution to the original large problem (1). 
Note that variants of Algorithm 2 can be considered: one can select more than a single path g to 
update the subgraph G, or one can approximately solve the subproblems (14). In the latter case, the 
stopping criterion could be relaxed in practice. One could use x < X + e, where £ is a small positive 
constant, or one could use a duality gap to stop the optimization when the solution is guaranteed to 
be optimal enough. 

In the next section, we present various experiments, illustrating how the different penalties and 
algorithms behave in practice. 

4. Experiments and Applications 

We now present experiments on synthetic, genomic and image data. Our algorithms have been im- 
plemented in C++ with a Matlab interface, they have been made available as part of the open-source 
software package SPAMS, originally accompanying Mairal et al. (2010). 5 We have implemented the 
proximal gradient algorithm FISTA (Beck and Teboulle, 2009) for convex regularization functions 
and ISTA for non-convex ones. When available, we use a relative duality gap as a stopping criterion 
and stop the optimization when the relative duality gap is smaller than 10~ 4 . In our experiments, 
we often need to solve Equation (1) for several values of the parameter X, typically chosen on a 
logarithmic grid. We proceed with a continuation strategy: first we solve the problem for the largest 

5. http : // spams -devel . gf orge .inria.fr/. 
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value of A-, whose solution can be shown to be when A. is large enough; then we decrease the value 
of X, and use the previously obtained solution as initialization. This warm restart strategy allows 
us to quickly follow a regularization path of the problem. For non-convex problems, it provides us 
with a good initialization for a given X. The algorithm ISTA with the non-convex penalty cpg is 
indeed only guaranteed to iteratively decrease the value of the objective function. As often the case 
with non-convex problems, the quality of the optimization is subject to a good initialization, and 
this strategy has proven to be important to obtain good results. 

As far as the choice of the parameters is concerned, we have observed that all penalties we have 
considered in our experiments are very sensitive to the regularization parameter X. Thus, we use 
a fine grid to choose X using cross-validation or a validation set. Some of the penalties involve an 
extra parameter, yin the case of (p« and . This parameter offers some flexibility, for example it 
promotes the connectivity of the solution for q>g and tyg , but it also requires to be tuned correctly 
to prevent overfitting. In practice, we have found the choice of the second parameter always less 
critical than X to obtain a good prediction performance, and thus we use a coarse grid to choose this 
parameter. All other implementation details are provided in each experimental section. 

4.1 Synthetic Experiments 

In this first experiment, we study our penalties cp^ and \\fg in a controlled setting. Since generating 
synthetic graphs reflecting similar properties as real-life networks is difficult, we have considered 
three "real" graphs of different sizes, which are part of the 10 th DIMACS graph partitioning and 
graph clustering challenge: 6 

• the graph jazz was compiled by Gleiser and Danon (2003) and represents a community net- 
work of jazz musicians. It contains p = 198 vertices and m = 2742 edges; 

• the graph email was compiled by Guimera et al. (2003) and represents e-mail exchanges in a 
university. It contains p = 1 133 vertices and m = 5451 edges; 

• the graph PGP was compiled by Boguna et al. (2004) and represents information interchange 
among users in a computer network. It contains p = 10680 vertices and m = 24316 edges. 

We choose an arbitrary topological ordering for all of these graphs, orient the arcs according to 
this ordering, and obtain DAGs. 7 We generate structured sparse linear models with measurements 
corrupted by noise according to different scenarii, and compare the ability of different regularization 
functions to recover the noiseless model. More precisely, we consider a design matrix X in W ixp 
with less observations than predictors (n = [p/2\), and whose entries are i.i.d. samples from a 
normal distribution f7\^(0, 1). For simplicity, we preprocess each column of X by removing its mean 
component and normalize it to have unit ^-norm. Then, we generate sparse vectors wo with k 
non-zero entries, according to different strategies which are described in the sequel. We synthesize 
an observation vector y = Xwo + £ in W, where the entries of e are i.i.d. draws from a normal 
distribution fA£(0, y/k/na), with different noise levels: 

• high SNR: we choose o = 0.2; this yields a signal noise ratio (SNR) j | Xwo 1 1 1 / 1 1 e 1 1 2 °f arx) ut 26. 
We note that for a < 0. 1 almost all penalties almost perfectly recover the true pattern; 

6. http : / /www . cc . gatech . edu/dimacs 10 /archive /clustering. shtml. 

7. A topological ordering ■< of vertices in a directed graph is such that if there is an arc from vertex u to vertex v, 
then u -< v. A directed graph is acyclic is and only if it possesses a topological ordering (see Ahuja et al., 1993). 
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• medium SNR: for a = 0.4, the SNR is about 6; 

• low SNR: for a = 0.8, the SNR is about 1.6. 

Choosing a good criterion for comparing different penalties is difficult, and a conclusion drawn 
from an experiment is usually only valid for a given criterion. For example, we present later an 
image denoising experiment in Section 4.2, where non-convex penalties outperform convex ones 
according to one performance measure, while being the other way around for another one. In 
this experiment, we choose the relative mean square error ||Xw — XwoHl as a criterion, and use 
ordinary least square (OLS) to refit the models learned using the penalties. Whereas OLS does not 
change the results obtained with the non-convex penalties we consider, it changes significantly the 
ones obtained with the convex ones. In practice, OLS counterbalances the "shrinkage" effect of 
convex penalties, and empirically improves the results quality for low noise regimes (high SNR), 
but deteriorates it for high noise regimes (low SNR). 

For simplicity, we also assume (in this experiment only) that an oracle gives us the optimal reg- 
ularization parameter A,, and therefore the conclusions we draw from the experiment are only the 
existence or not of good solutions on the regularization path for every penalty. A more exhaustive 
comparison would require testing different combinations (with OLS, without OLS) and different cri- 
teria, and using internal cross-validation to select the regularization parameters. This would require 
a much heavier computational setting, which we have chosen not to implement in this experiment. 
After obtaining the matrix X, we propose several strategies to generate "true" models wo: 

• in the scenario flat we randomly select k entries without exploiting the graph structure; 

• the scenario graph consists of randomly selecting 5 entries, and iteratively selecting new 
vertices that are connected in G to at least one previously selected vertex. This produces 
fairly connected sparsity patterns, but does not exploit arc directions; 

• the scenario path is similar to graph, but we iteratively add new vertices following single 
paths in G. It exploits arc directions and produces sparsity patterns that can be covered by a 
small number of paths, which is the sort of patterns that our path-coding penalties encourage. 

The number of non-zero entries in wo is chosen to be k = [0-lpJ f° r the different graphs, resulting 
in a fairly sparse vector. The values of the non-zero entries are randomly chosen in {— 1, + 1}. We 
consider the formulation (1) where L is the square loss: L(w) = ^ ||y — Xw||| and £1 is one of the 
following penalties: 

• the classical £q- and t\ -penalties; 

• the penalty \|/^ of Jacob et al. (2009) where the groups Q are pairs of vertices linked by an arc; 

• our path-coding penalties ($>g p or \\fg with the weights r\ g defined in (7). 

• the penalty of Huang et al. (201 1), and their strategy to encourage sparsity pattern with a small 
number of connected components. We use their implementation of the greedy algorithm 
StructOMP 8 . This algorithm uses a strategy dubbed "block-coding" to approximately deal 
with this penalty (see Huang et al., 2011), and has an additional parameter, which we also 
denote by /u, to control the trade-off between sparsity and connectivity. 

8. http : //ranger . uta . edu/~huang/R_StructuredSparsity . htm 
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For every method except StructOMP, the regularization parameter X is chosen among the values 2'/ 4 , 
where i is an integer. We always start from a large value for i, and decrease its value by one, 
following the regularization path. For the penalties ($g and \\Tg ■ , the parameter yis simply chosen 
in {1/4, 1 /2, 1,2,4}. Since the algorithm StructOMP is greedy and iteratively increases the model 
complexity, we record every solution obtained on the regularization path during one pass of the 
algorithm. Based on information-theoretic arguments, Huang et al. (2011) propose a default value 
for their parameter fi = 1. Since changing this parameter value empirically improves the results 
quality, we try the values {1/4,1/2,1,2,4} for a fair comparison with our penalties ($>g p and . 

We report the results for the three graphs, three scenarii for generating wo, three noise levels 
and the five penalties in Figure 4. We report on these graphs the ratio between the prediction 
mean square error and the best achievable error if the sparsity pattern was given by an oracle. In 
other words, denoting by w oracle the OLS estimate if an oracle gives us the sparsity pattern, we 
report the value ||Xw — Xwo||2/||Xw oracle — Xwoll^- The rjest achievable value for this criterion is 
therefore 1 , which is represented by a dotted line on all graphs. We reproduce the experiment 20 
times, randomizing every step, including the way the vector wo is generated to obtain error bars 
representing one standard deviation. 

We make pairwise comparisons and statistically assess our conclusions using error bars or, when 
needed, paired one-sided T-tests with a 1% significance level. The comparisons are the following: 

• convex vs non-convex (£ vs £\ and ($>g p vs Vfg p )'. For high SNR, non-convex penalties do 
significantly better than convex ones, whereas it is the other way around for low SNR. The dif- 
ferences are highly significant for the graphs email and PGP. For medium SNR, conclusions 
are mixed, either the difference between a convex penalty and its non-convex counterpart are 
not significant or one is better than another. 

• unstructured vs path-coding (£q vs ($>g p and £\ vs ^g p Y- In the structured scenarii graph 
and path, the structured penalties q>g p and \\fg respectively do better than £q and i\. In the 
unstructured flat scenario, £q and £\ should be preferred. More precisely, for the scenarii 
graph and path, tyg and \\fg respectively outperform £q and £\ with statistically significant 
differences, except (i) for high SNR, both (pg and £o achieve perfect recovery; (ii) with the 
smallest graph jazz, the p- values obtained to compare vs £\ are slightly above our 1% 
significance level. In the flat scenario, £q and (pg t give similar results, whereas \|/^ performs 
slightly worse than l\ in general even though they are very close. 

• Jacob et al. (2009) vs path-coding (\pg with pairs vs Vfg t )'- in the scenario path, \p~g out- 
performs \\fg (pairs). It is generally also the case in the scenario graph. The differences are 
always significant in the low SNR regime and with the largest graph PGP. 

• Huang et al. (2011) vs path-coding (StructOMP vs <pg p ,\pg p ): For the scenario path, either 
(pg (for high SNR) or \\tg (for low SNR) outperform StructOMP. For the scenario graph, 
the best results are shared between StructOMP and our penalties for high and medium SNR, 
and our penalties do better for low SNR. More precisely in the scenario graph: (i) there is no 
significant difference for high SNR between (pg p and StructOMP; (ii) for medium SNR, Struc- 
tOMP does slightly better with the graph PGP and similarly as cp^ for the two other graphs; 
(iii) for low SNR, our penalties do better than StructOMP with the two largest graphs email 
and PGP and similarly with the smallest graph jazz. 
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To conclude this experiment, we have shown that our penalties offer a competitive alternative to 
StructOMP and the penalty of Jacob et al. (2009), especially when the "true" sparsity pattern is 
exactly a union of a few paths in the graph. We have also identified different noise and size regimes, 
where a penalty should be preferred to another. Our experiment also shows that having both a non- 
convex and convex variant of a penalty can be interesting. In low SNR, convex penalties are indeed 
better behaved than non-convex ones, whereas it is the other way around when the SNR is high. 
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Figure 4: Every bar represents the ratio between the mean square error estimate and the oracle mean 
square error estimate (see main text for an explicit formula and the full experimental setting). The 
error bars represent one standard deviation. Each row corresponds to a specific noise level, and 
every column to a different graph. For a specific noise level and specific graph, the results for three 
scenarii, flat, graph and path are reported. Each group of six bars represents the results obtained 
with six penalties, from left to right: £q, t\, \\fg (with Q being the pairs of vertices linked by an arc), 
(pg p , \\fg ■ , and the method StructOMP. A legend is presented in the top right figure. 
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4.2 Image Denoising 

State-of-the-art image restoration techniques are often based on a good model for small image 
patches, of size for instance 10 x 10 pixels (Elad and Aharon, 2006; Dabov et al., 2007; Mairal 
et al., 2009). We consider here the task of denoising natural images corrupted by white Gaussian 
noise, using an approach introduced by Elad and Aharon (2006). It consists of the following steps: 

1. extract all overlapping patches (y');=i,...,m from a noisy image; 

2. compute a sparse approximation of every individual patch y ! : 

1 



mm 



v y'-xw''i+xn(w ! ') 



(15) 



where the matrix X = [x 1 ,...^] in M" xp is a predefined "dictionary", XCl is a sparsity- 
inducing regularization and the term Xw' is the clean estimate of the noisy patch y' ; 

3. since the patches overlap, each pixel admits several estimates. The last step consists of aver- 
aging the estimates of each pixel to reconstruct the full image. 

Whereas Elad and Aharon (2006) learn an overcomplete basis set to obtain a "good" matrix X 
in the step 2 above, we choose a simpler approach and use an orthogonal discrete cosine trans- 
form (DCT) dictionary X (Ahmed et al., 1974) for which there exists a natural directed acyclic 
graph structure. Such dictionary is classically used in the image processing literature (see Elad and 
Aharon, 2006); we present such a dictionary in Figure 5 for 8 x 8 image patches. DCT elements can 
be organized on a two-dimensional grid and ordered by horizontal and vertical frequencies. We use 
the DAG structure presented in Figure 5 connecting neighbors on the grid, going from low to high 
frequencies. Note that since the dictionary X is orthogonal, the non-convex problems we address 
here are solved exactly. We this experiment, we address the following questions: 

(A) In terms of optimization, is our approach efficient for this experiment? Because the number 
of problems to solve is large (several millions), the task is difficult. 

(B) Do we get better results by using the graph structure than with classical and l\-penalties? 

(C) How does the method compare with the state of the art? 

Since the dictionary X in W xp is orthogonal, it can be shown that Equation (15) is equivalent to 

1 



mm 



^llxV-w'ni+xnCw 1 ') 



and therefore the solution admits a closed form w 1 * = Prox^(X T y ! ). For £q and i\, the solution 
is indeed respectively obtained by hard and soft-thresholding, and we have introduced some tools 
in Section 3 to compute the proximal operators of q>g and \|/^ ;) . We consider ex e image patches, 
with e S {6,8,10,12,14,16}, and a parameter X on a logarithmic scale with step 2 1 / 8 . We also 
exploit the variant of our penalties presented in Section 3 that allows us to choose the costs on the 
arcs of the graph G'. We choose here a small cost on the arc (s, 1) of the graph G', and a large one 
for every arc (s,j), for j in {2, . . . ,p}, such that all paths selected by our approach are encouraged to 
start by the variable 1 (equivalently the dictionary element x 1 with the lowest frequencies). We use 
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Figure 5: Orthogonal DCT dictionary with « = 8 x 8 image patches. The dictionary elements are 
organized by horizontal and vertical frequencies. 



a dataset of 12 classical high-quality images (uncompressed and free of artifact). We optimize the 
parameters X and e on the first 3 images, keeping the 9 last images as a test set and report denoising 
results on Table 1. Even though this dataset is relatively small, it is standard in the image processing 
literature, making the comparison easy with competitive approaches. 9 

We start by answering question (A): we have observed that the time of computation depends on 
several factors, including the problem size and the sparsity of the solution (the sparser, the faster). 
In the setting o= 10 and e = 8 we are able to denoise approximately 4000 patches per second 
using yg p , and 1 800 in the setting G = 50 and e=l4 with our laptop 1.2Ghz CPU (core i3 330UM). 
The penalty \\fg requires solving quadratic minimum cost flow problems, and is slower to use 
in practice: The numbers 4000 and 1 800 above respectively become 70 and 130. Our approach 
with q>g proves therefore to be fairly efficient for our task, allowing us to process an image with 
about 250000 patches in between one and three minutes. As expected, simple penalties are faster 
to use: about 65000 patches per second can be processed using £q. 

Moving now to question (B), the best performance among the penalties £o, £\, (^>g > and Vfg is 
obtained by q>g . This difference is statistically significant: We measure for instance an average 
improvement of 0.38 ±0.21 dB over £q for a > 20. For this denoising task, it is indeed typical 
to have non-convex penalties outperforming convex ones (see Mairal, 2010, Section 1.6.5, for a 
benchmark between £o and l\ -penalties), and this is why the original method of Elad and Aharon 
(2006) uses the £o-penalty. Interestingly, this superiority of non-convex penalties in this denoising 
scheme based on overlapping patches is usually only observed after the averaging step 3. When one 
considers the quality of the denoising of individual patches without averaging — that is, after step 2, 
opposite conclusions are usually drawn (see again Mairal, 2010, Section 1.6.5). We therefore report 
mean-square error results for individual patches without averaging in Table 2 when e = 10. As 
expected, the penalty \\Tg turns out to be the best at this stage of the algorithm. Note that the bad 
results obtained by convex penalties after the averaging step are possibly due to the shrinkage effect 
of these penalties. It seems that the shrinkage is helpful for denoising individual patches, but hurts 
after the averaging process. 



9. This dataset can be found for example in Mairal et al. (2009). 
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We also present the performance of state-of-the-art image denoising approaches in Table 1 to 
address question (C). We have chosen to include in the comparison several methods that have suc- 
cessively been considered as the state of the art in the past: the Gaussian Scale Mixture (GSM) 
method of Portilla et al. (2003), the K-SVD algorithm of Elad and Aharon (2006), the BM3D 
method of Dabov et al. (2007) and the sparse coding approach of Mairal et al. (2009) (LSSC). We 
of course do not claim to do better than the most recent approaches of Dabov et al. (2007) or Mairal 
et al. (2009), which in addition to sparsity exploit non-local self similarities in images (Buades et al., 
2005). Nevertheless, given the fact that we use a simple orthogonal DCT dictionary, unlike Elad 
and Aharon (2006) who learn overcomplete dictionaries adapted to the image, our approach based 
on the penalty cp^ performs relatively well. We indeed obtain similar results as Elad and Aharon 
(2006) and Portilla et al. (2003), and show that structured parsimony could be a promising tool in 
image processing. 
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State-of-the-art approaches 


Portilla etal.,2003 (GSM) 


36.96 


33.19 


31.17 


29.78 


28.74 


25.67 


22.96 


Elad and Aharon, 2006 (K-SVD) 


37.11 


33.28 


31.22 


29.81 


28.72 


25.29 


22.02 


Dabov et al., 2007 (BM3D) 


37.24 


33.60 


31.68 


30.36 


29.36 


26.11 


23.11 


Mairal etal., 2009 (LSSC) 


37.29 


33.64 


31.70 


30.36 


29.33 


26.20 


23.20 



Table 1: Denoising results on the 9 test images. The numbers represent the average PSNR in 
dB (higher is better). Denoting by MSE the mean-squared-error for images whose intensities are 
between and 255, the PSNR is defined as PSNR = 101og 10 (255 2 /MSE). Pixel values are scaled 
between and 255 and o (the standard deviation of the noise) is between 5 and 100. The top part 
of the table presents the results of the denoising scheme we have obtained with different penalties. 
The bottom part presents the results obtained with various state-of-the-art denoising methods (see 
main text for more details). Best results are in bold for both parts of the table. 

4.3 Breast Cancer Data 

One of our goal was to develop algorithmic tools improving the approach of Jacob et al. (2009). It 
is therefore natural to try one of the dataset they used to obtain an empirical comparison. On the 
one hand, we have developed tools to enrich the group structure that the penalty \|/^ could handle, 
and thus we expect better results. On the other hand, the graph in this experiment is undirected and 
we need to use heuristics to transform it into a DAG. 

We use in this task the breast cancer dataset of Van De Vijver et al. (2002). It consists of gene ex- 
pression data from 8 141 genes in n = 295 breast cancer tumors and the goal is to classify metastatic 
samples versus non-metastatic ones. Following Jacob et al. (2009), we use the gene network com- 
piled by Chuang et al. (2007), obtained by concatenating several known biological networks. As 
argued by Jacob et al. (2009), taking into account the graph structure into the regularization has two 
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Table 2: Denoising results for individual 10 x 10 image patches on the 9 test images. The numbers 
represent mean-squared error for the image patches (lower the better). Best results are in bold. 

objectives: (i) possibly improving the prediction performance by using a better prior; (ii) identifying 
connected subgraphs of genes that might be involved in the metastatic form of the disease, leading 
to results that yield better interpretation than the selection of isolated genes. Even though prediction 
is our ultimate goal in this task, interpretation is equally important since it is necessary in practice to 
design drug targets. In their paper, Jacob et al. (2009) have succeeded in the sense that their penalty 
is able to extract more connected patterns than the ^i-regularization, even though they could not 
statistically assess significant improvements in terms of prediction. Following Jacob et al. (2009), 
we also assume that connectivity of the solution is an asset for interpretation. The questions we 
address here are the following: 

(A) Despite the heuristics described below to transform the graph into a DAG, does our approach 
lead to well-connected solutions in the original (undirected) graph? Do our penalties lead to 
better-connected solutions than other penalties?. 

(B) Do our penalties lead to better classification performance than Jacob et al. (2009) and other 
classical unstructured and structured regularization functions? Is the graph structure useful 
to improve the prediction? Does sparsity help prediction? 

(C) How efficient is our approach for this task? The problem here is of medium/large scale but 
should be solved a large number of times (several thousands of times) because of the internal 
cross validation procedure. 

The graph of genes, which we denote by Go, contains 42587 edges, and, like Jacob et al. (2009), 
we keep the /? = 7910 genes that are present in Go- In order to obtain interpretable results and select 
connected components of Go, Jacob et al. (2009) used their structured sparsity penalty \|/^ where the 
groups Q are all pairs of genes linked by an arc. Our approach requires a DAG, but we will show in 
the sequel that we nevertheless obtain good results after heuristically transforming Go into a DAG. 
To do so, we first treat Go as directed by choosing random directions on the arcs, and second we 
remove some arcs along cycles in the graph. It results in a DAG containing 33 303 arcs, which we 
denote by G. This pre-processing step is of course questionable since our penalties are originally 
not designed to deal with the graph Go- We of course do not claim to be able to individually interpret 
each path selected by our method, but, as we show, it does not prevent our penalties cp^ ; and Vfg to 
achieve their ultimate goal — that is connectivity in the original graph Go- 

We consider the formulation (1) where L is a weighted logistic regression loss: 

^(w)^£-io g (i + e -^ Tx 0, 

■ , n v . 

;=1 yi 
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where the yi's are labels in {— 1,+1}, the x,'s are gene expression vectors in W\ The weight n\ 
is the number of positive samples, whereas «_i the number of negative ones. This model does not 
include an intercept, but the gene expressions are centered. The regularization functions that are 
included in the comparison are the following: 

• our path-coding penalties q>g and \\fg with the weights r\ g defined in (7); 

• the squared £2 -penalty (ridge logistic regression); 

• the ^i-norm (sparse logistic regression); 

• the elastic-net penalty of Zou and Hastie (2005), which has the form w — > || w|| 1 + (fi/2) || w^, 
where n is an additional parameter; 

• the penalty \|/^ of Jacob et al. (2009) where the groups Q are pairs of vertices linked by an arc; 

• a variant of the penalty v/g of Jacob et al. (2009) whose form is given in Equation (17) of 
Appendix B, where the £2 -norm is used in place of the ^-norm; 

• the penalty C,g of Jenatton et al. (201 1) given in Appendix A where the groups are all pairs of 
vertices linked by an arc; 

• the penalty C,g of Jenatton et al. (201 1) using the group structure adapted to DAGs described 
in Appendix A. This penalty has shown to be empirically problematic to use directly. The 
number of groups each variable belongs to significantly varies from a variable to another, 
resulting in overpenalization for some variables and underpenalization for some others. To 
cope with this issue, we have tried different strategies to choose the weights r\ g for every group 
in the penalty, similarly as those described by Jenatton et al. 2011, but we have been unable to 
obtain sparse solutions in this setting (typically the penalty selects here more than a thousand 
variables). A heuristic that has proven to be much better is to add a weighted £\ -penalty to C,g 
to correct the over/under-penalization issue. Denoting for a variable j in {1, ... ,p} by dj the 
number of groups the variable j belongs to — in other words dj = Y*geg-.g3j 1> we a dd the term 
Yfj=\ (max* dk - dj) | w ; - 1 to the penalty t,g. 

The parameter X in Eq. (1) is chosen on a logarithmic scale with steps 2 1 / 4 . The elastic-net pa- 
rameter /u is chosen in {1,10,100}. The parameter y for the penalties ($>g p and is chosen 
in {2,4,8,16}. We proceed by randomly sampling 20% of the data as a test set, keeping 80% 
for training, selecting the parameters X,/u,y using internal 5-fold cross validation on the training set, 
and we measure the average balanced error rate between the two classes on the test set. We have 
repeated this experiment 20 times and we report the averaged results in Table 3. 

We start by answering question (A). We remark that our penalties q>g } and \|/^ ; succeed in 
selecting very few connected components of Go, on average 1.3 for \\fg and 1.6 for q>g p while 
providing sparse solutions. This significantly improves the connectivity of the solutions obtained 
using the approach of Jacob et al. (2009) or Jenatton et al. (201 1). To claim interpretable results, one 
has of course to trust the original graph. As Jacob et al. (2009), we assume that connectivity in Go 
is a prior information. We also study the effect of the preprocessing step we have used to obtain a 
directed acyclic graph G from Go- We report in the row -random" in Table 3 the results we 
obtain when randomizing the pre-processing step between every experimental run (providing us a 
different graph G for every run). We observe that the outcome G does not significantly change the 
sparsity and connectivity in Go of the sparsity patterns our penalty selects. 
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As far as the prediction performance is concerned, \\fg seems to be the only penalty that is able 
to produce sparse and connected solutions while providing a similar average error rate as the £2- 
penalty. The non-convex penalty tyg produces a very sparse solution which is connected as well, but 
with an approximately 6% higher classification error rate. Note that because of the high variability 
of this performance measure, clearly assessing the statistical significance of the observed difference 
is difficult. As it was previously observed by Jacob et al. (2009), the data is very noisy and the 
number of samples is small, resulting in high variability. As Jacob et al. (2009), we have been 
unable to test the statistical significance rigorously — that is, without assuming independence of the 
different experimental runs. We can therefore not clearly answer the first part of question (B). The 
second part of the question is however more clear: neither sparsity, nor the graph structure seems to 
help prediction in this experiment. We have for example tried to use the same graph G, but where 
we randomly permute the p predictors (genes) at every run, making the graph structure irrelevant to 
the data. We report in Table 3 at the row "\|/^, -permute" the average classification error rate, which 
is not significantly different than without permutation. 

Our conclusions about the use of structured sparse estimation for this task are therefore mixed. 
On the one hand, none of the tested method is shown to perform statistically better in prediction 
than simple ridge regularization. On the other hand, our penalty \|/^ ; is the only one that performs 
as well as ridge while selecting a few predictive genes forming a a well connected sparsity pattern. 
According to Jacob et al. (2009), this is a significant asset for biologists, assuming the original graph 
should be trusted. 

Another aspect we would like to study is the stability properties of the selected sparsity patterns, 
which is often an issue with features selection methods (Meinshausen and Biihlmann, 2010). By 
introducing strong prior knowledge in the regularization, structured sparse estimation seems to pro- 
vide more stable solutions than l\. For instance, 5 genes are selected by t\ in more than half of the 
experimental runs, whereas this number is 10 and 14 for the penalties of Jacob et al. (2009), and 33 
for . Whereas we believe that stability is important, it is however hard to claim direct benefits of 
having a "stable" penalty without further study. By encouraging connectivity of the solution, vari- 
ables that are highly connected in the graph tend to be more often selected, improving the stability 
of the solution, but not necessarily its interpretation in the absence of biological prior knowledge 
that prefers connectivity. 

To answer question (C), we study the computational efficiency of our implementation. One it- 
eration of the proximal gradient method for the selected parameters is relatively fast, approximately 
0.17s for \\fg p and 0.15s for q>g p on a 1.2GHz laptop CPU (Intel core i3 330UM), but it tends to 
be significantly slower when the solution is less sparse, for instance with small values for X. Since 
solving an instance of problem (1) requires computing about 500 proximal operators to obtain a 
reasonably precise solution, our method is fast enough to conduct this experiment in a reasonable 
amount of time. Of course, simpler penalties are faster to use. An iteration of the proximal gradient 
method takes about 0.15s for C, g , 0.05s for Jacob et al. (2009), 0.01s for l 2 and 0.003s for i x . 

5. Conclusion 

Our paper proposes a new form of structured penalty for supervised learning problems where pre- 
dicting features are sitting on a DAG, and where one wishes to automatically select a few connected 
subgraphs of the DAG. The computational feasibility of this form of penalty is established by mak- 
ing a new link between supervised path selection problems and network flows. Our penalties admit 
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test error (%) 


sparsity 


connected components 




31.0±6.1 


7910 


58 




36.0 ±6.5 


32.6 


30.9 


ei + £i 


31.5±6.7 


929.6 


355.2 


Jacob et al. (2009)-^ 


35.9 ±6.8 


68.4 


13.2 


Jacob et al. (2009)-^ 2 


36.0±7.2 


58.5 


11.1 


Jenatton et al. (201 1) (pairs) 


34.5 ±5.2 


33.4 


28.8 


Jenatton et al. (201 1) (DAG)+weighted i x 


35.6±7.0 


54.6 


28.4 




36.0±6.8 


10.2 


1.6 




30.2 ±6.8 


69.9 


1.3 


^ -permute 


33.2±7.6 


143.2 


1.7 


-random 


31.6±6.0 


78.5 


1.4 



Table 3: Experimental results on the breast cancer dataset (Van De Vijver et al., 2002) for different 
penalties. Column "test error": average balanced classification error rate on the test set in percents 
with standard deviations; the results are averaged over 20 runs and the parameters are selected for 
each run by internal 5-fold cross validation. Column "sparsity": average number of selected genes. 
Column "connected components": average number of selected connected components in Gq. 



non-convex and convex variants, which can be used within the same network flow optimization 
framework. These penalties are flexible in the sense that they can control the connectivity of a prob- 
lem solution, whether one wishes to encourage large or small connected components, and are able 
to model long-range interactions between variables. 

Some of our conclusions show that being able to provide both non-convex and convex variants 
of the penalties is valuable. In various contexts, we have been able to find situations where convexity 
is helpful, and others where the non-convex approach leads to better solutions than the convex one. 
Our experiments show that when connectivity of a sparsity pattern is a good prior knowledge our 
approach is fast and effective for solving different prediction problems. 

Interestingly, our penalties seem to perform empirically well on more general graphs than 
DAGs, when heuristically removing cycles, and we would like in the future to find a way to better 
handle them. We also would like to make further connections with image segmentation techniques, 
which exploit different but related optimization techniques (see Boykov et al., 2001; Couprie et al., 
2011), and kernel methods, where other types of feature selection in DAGs occur (Bach, 2008). 

Finally, we are also interested in applying our techniques to sparse estimation problems where 
the sparsity pattern is expected to be exactly a combination of a few paths in a DAG. While the 
first version of this manuscript was under review, the first author started a collaboration with com- 
putational biologists to address the problem of isoform detection in RNA-Seq data. In a nutshell, 
a mixture of small fragments of mRNA is observed and the goal is to find a few mRNA molecules 
that explain the observed mixture. In mathematical terms, it corresponds to selecting a few paths 
in a directed acyclic graph in a penalized maximum likelihood formulation. Preliminary results 
obtained by Bernard et al. (2013) confirm that one could achieve state-of-the-art results for this task 
by adapting part of the methodology we have developed in this paper. 
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Appendix A. The Penalty of Jenatton et al. (2011) for DAGs 

The penalty of Jenatton et al. (2011) requires a pre-defined set of possibly overlapping groups Q 
and is defined as follows: 

C$(w)= Lnjwjv, (16) 

where the vector w g in records the coefficients of w indexed by g in the scalars are positive 
weights, and v typically equals 2 or oo. This penalty can be interpreted as the ^-norm of the vector 
[ r lgll w gllv]g££> therefore inducing sparsity at the group level. It extends the Group Lasso (Turlach 
et al., 2005; Yuan and Lin, 2006; Zhao et al., 2009) by allowing the groups to overlap. 

Whereas the penalty \|/^ of Jacob et al. (2009) encourages solutions whose set of non-zero 
coefficients is a union of a few groups, the penalty C,g promotes solutions whose sparsity pattern is 
in the intersection of some selected groups. This subtlety makes these two lines of work significantly 
different. It is for example unnatural to use the penalty C,g to encourage connectivity in a graph. 
When the groups are defined as the pairs of vertices linked by an arc, it is indeed not clear that 
sparsity patterns defined as the intersection of such groups would lead to a well connected subgraph. 
As shown experimentally in Section 4 this setting indeed performs poorly for this task. 

However, when the graph is a DAG, there exists an appropriate group setting § when the sparsity 
pattern of the solution is expected to be a single connected component of the DAG. Let us indeed 
define the groups to be the sets of ancestors, and sets of descendents for every vertex; the set of 
descendents of a vertex u in a DAG are defined as all vertices v such that there exists a path from u 
to v. Similarly the set of ancestors contains all vertices such that there is a path from v to u. The 
corresponding penalty C,g encourages sparsity patterns which are intersections of groups in Q, which 
can be shown to be exactly the connected subgraphs of the DAG. 10 This penalty is tractable since 
the number of groups is linear in the number of vertices, but as soon as the sparsity pattern of the 
solution is not connex (contains more than one connected component), it is unable to recover it, 
making it useful to seek for a more flexible approach. For this group structure Q, the penalty C,g 
also suffers from other practical issues concerning the overpenalization of variables belonging to 
many different groups. These issues are empirically discussed in Section 4 on concrete examples. 

Interestingly, Mairal et al. (201 1) have shown that the penalty C,g with v = °° and any arbitrary 
group structure Q is related to network flows, but for different reasons than the penalties cp^ and 
\[fg . The penalty C,g is indeed unrelated to the concept of graph sparsity since it does not require the 

10. This setting was suggested to us by Francis Bach, Rodolphe Jenatton and Guillaume Obozinski in a private discussion. 
Note that we have assumed here for simplicity that the DAG is connected — that is, has a single connected component. 
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features to have any graph structure. Solving regularized problems with C,g is however challenging, 
and Mairal et al. (2011) have shown that the proximal operator of C,g could be computed by means 
of a parametric maximum flow formulation. It involves a bipartite graph where the nodes represent 
variables and groups, and arcs represent inclusion relations between a variable and a group. Mairal 
et al. (2011) address thus a significantly different problem than ours, even though there is a common 
terminology between their work and ours. 

Appendix B. Links Between Huang et al. (2011) and Jacob et al. (2009) 

Similarly as the penalty of cp^ of Huang et al. (201 1), the penalty of Jacob et al. (2009) encourages 
the sparsity pattern of a solution to be the union of a small number of predefined groups Q. Unlike 
the function q>g, it is convex (it can be shown to be a norm), and is defined as follows: 

V|^(w)^ min J EtiJ^IIv s.t. w= and Vg 6 g, Supp(^) C g\ , (17) 

(^ RP WUee g Tg J 

where ||.|| v typically denotes the ^ 2 -norm (v = 2) or ^-norm (v = °°). In this equation, the vector w 
is decomposed into a sum of latent vectors one for every group g in Q, with the constraint that 
the support of £f is itself in g. The objective function is a group Lasso penalty (Yuan and Lin, 2006; 
Turlach et al., 2005) as presented in Equation (16) which encourages the vectors hf to be zero. As 
a consequence, the support of w is contained in the union of a few groups g corresponding to non- 
zero vectors which is exactly the desired regularization effect. We now give a proof of Lemma 1 
relating this penalty to the convex relaxation of cp^ given in Equation (6) when v = °°. 
Proof. We start by showing that is equal to the penalty \|/^ defined in Equation (6) on We 
consider a vector w in and introduce for all groups g in Q appropriate variables t, 8 in R p . The 
linear program defining \|/^ can be equivalently rewritten 

\]/^(w) = min < r\ T x s.t. £ = w, Nx > £ £f and V g G g, Supp(^ g ) C g 

xeR ] f { geg geg 

where we use the assumption that for all vector w in R+, there exist vectors such that Y,gegk 8 = 
w. Let us consider an optimal pair (x, (t, 8 ) ge g). For all indices j in {1,. .. , p}, the constraint Nx > 
Hgegtf leads to the following inequality 

V v ' V v ' 

ff>0 TT<0 

where x g denotes the entry of x corresponding to the group g, and two new quantities xj and xj are 
defined. For all g in g, we define a new vector £ g such that for every pair (g,j) in g x {1, .. . ,p}: 

1. ifj^,^0; 

2. if j eg and x g > then & = x g ; 
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3. if j e g and x g < then $ 4 £J - ( Xg - ^)|. 

Note that if there exists j and g such that x g < then xj is nonzero and the quantity x+/xj is well 
defined. Simple verifications show that for all indices j in {1, ... ,p}, we have Y<g3j x g ~ = T / + 
T 7 = EgsjXg - £,p and therefore L ge ^' g = Lgeg^ g = w - The pair (x, (£ > ' 8 ) ge g) is therefore also 
optimal. In addition, for all groups g in Q and index / in { 1 , . . . , p}, it is easy to show that x g — J^f > 
and that we have at optimality sign(^) = sign(wy) = 1 for any nonzero J£. Therefore, the condition 
||^' g ||oo < x g is satisfied, which is stronger than the original constraint Nx > Y.^eg^' 8 - Moreover, it 
is easy to show that ||^||<» is necessary equal to x g at optimality (otherwise, one could decrease the 
value of x g to decrease the value of the objective function). We can now rewrite \\fg(yv) as 

V£(w) = <! min Yry^Hc s.t. £ ^ g = w, and Vg G g, Supp(^) C g 

{&€R") g eg g tg gtg 

and we have shown that \|/'^ = \|/^ on By noticing that in Equation (6) a solution (£, g ) ge g nec- 
essary satisfies sign(^) = sign(w ; ) for every group g and index j such that ^ ^ 0, we can extend 
the proof from R^ to W\ ■ 



Appendix C. Interpretation of the Weights r\ g with Coding Lengths 

Huang et al. (2011) have given an interpretation of the penalty cp^- defined in Equation (4) in terms 
of coding length. We use similar arguments to interpret the path-coding penalty cp^ ;; from an 
information-theoretic point of view. For appropriate weights r\ g , the quantity cp^w) for a vec- 
tor w in R p can be seen as a coding length for the sparsity pattern of w — that is, the following 
Kraft-MacMillan inequality (see Cover and Thomas, 2006; MacKay, 2003) is satisfied: 

£ 2"<M S ) < 1. (18) 
Se{o,i}p 

It is indeed well known in the information theory literature that there exists a binary uniquely de- 
codeable code over {0, 1} P with code length <Pg p (S) for every pattern S in {0, 1} P if and only if the 
above inequality is satisfied (see Cover and Thomas, 2006). We now show that a particular choice 
for the weights r\ g leads to an interesting interpretation. 

Let us consider the graph G' with source and sink vertices s and t defined in Section 3. We 
assume that a probability matrix transition n(u,v) for all (u,v) in E' is given. With such a matrix 
transition, it is easy to obtain a coding length for the set of paths Q p : 

Lemma 3 (Coding Length for Paths.) 

Let clgfor a path g = (vi, . . . ,vj) in Q p be defined as 




-\og 2 %(v k ,t). 



Then cl g is a coding length on Q p . 
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Proof. We observe that for every path (vi , . . . , Vk) in Q p corresponds a unique walk of length |V'| of 
the form (s,v\, .. . ,Vb,t,t, ■ ■ ■ ,t), and vice versa. Denoting by n'(s,t) the probability that a Markov 
chain associated to the probability transition matrix 7C stalling at the vertex u is at the vertex v at 
time t, it is easy to show that 

£2-^=^1(5,0 = 1, 

and therefore c\ g is a coding length on Q p . ■ 

the term — log 2 7t(5,vi) represents the number of bits used to indicate that a path g starts with the 
vertex vi, whereas the bits corresponding to the terms — log 2 7t(v;,v;_|_i) indicate that the vertex 
following Vi is Vj+i. The bits corresponding to last term — log 2 7t(v/t,f) indicate the end of the path. 
To define the weights r\ g , we now define the following costs: 

A J 1 — log 2 7t(w,v) if u = s 
uv \ — log 2 7t(«,v) otherwise. 

The weight therefore satisfies r\ g = L(h.v)g£' c «v = cl g + 1, and as shown by Huang et al. (201 1), 
this is a sufficient condition for (pg_(w) to be a coding length for {0, 1} P . 

We have therefore shown that (i) the different terms composing the weights r\ g can be interpreted 
as the number of bits used to encode the paths in the graph; (ii) it is possible to use probability 
transition matrices (or random walks) on the graph to design the weights r\ g . 

Appendix D. Proofs of the Propositions 
D.l Proofs of Propositions 1 and 2 

Proof. We start by proving Proposition 1. Let us consider the alternative definition of cp^ ; given 
in Equation (5). This is an optimization problem over all paths in G, or equivalently all (s,?)-pafhs 
in G' (since these two sets are in bijection). We associate to a vector x in {0, 1} P a flow / on G ', 
obtained by sending one unit of flow on every (s,t)-path g satisfying x 8 = 1 (x g denotes the entry 
of x associated to the group/path g). Each of these (s,?)-pafh flow has a cost r\ g and the total cost 
of / is exactly rj T x. 

We also observe that within this network flow framework, the constraint Nx > Supp(w) in 
Equation (5) is equivalent to saying that for all j in {1,. . . ,p} the amount of flow going through the 
vertex j (denoted by Sj(f)) is greater than one if Wj ^ 0. We have therefore shown that (p^ (w) is 
the minimum cost achievable by a flow / such that the constraint sj(f) > 1 is satisfied for all j in 
Supp(w) and such that / can be decomposed into binary (s,?)-pafh flows. 

To conclude the proof of Proposition 1, we show that there exists an optimal flow that admits 
a decomposition into binary (s,?)-path flows. We notice that all arc capacities in Equation (9) are 
integers. A classical result (Ahuja et al., 1993, Theorem 9.10) says that there exists an optimal inte- 
ger minimum-cost flow (a flow whose values on arcs are integers). We denote by / such a solution. 
Then, the flow decomposition theorem (Bertsekas, 1998, Proposition 1.1) tells us that / can be de- 
composed into (s,?)-path flows, but it also says that if / is integer, then / can be decomposed into 
integer (s,t)-path flows. We conclude the proof by noticing that sending more than one unit of flow 
on a path is not optimal (one can reduce the cost by sending only one unit of flow, while keeping 
the capacity constraints satisfied), and therefore there exists in fact a decomposition of / into binary 
(s,?)-pafh flows. The quantity presented in Equation (9) is therefore equal to cp^ fw). 
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The proof of Proposition 2 builds upon the definition of \\fg given in Equation (6) and is similar 
to the one of Proposition 1 . ■ 



D.2 Proof of Proposition 3 

Proof. Using the definition of the proximal operator in Equation (3) and the definition of cp^ in 
Equation (5), there exists a pattern S in {0, 1} P such that the solution w* of the proximal problem 
satisfies for all j, w*j = u 7 if j is in S, and = otherwise. We therefore rewrite Equation (3) by 
using the result of Proposition 1 




Ss{0,l} p ,/eJ i -> i- 

When S is fixed, the above expression is a minimum cost flow problem with integer capacity con- 
straints. Thus, there exists an integer flow solution, and we can, without loss of generality, con- 
strain / to be integer, and replace the constraints sj(f) > 1 by Sj(f) > 0. After this modifica- 
tion, for / is fixed, minimizing with respect to 5 gives us the following closed form: for all j 
in {1, . . . ,p}, Sj = 1 if Sj(f) > and otherwise. With this choice for S, we have in addition 
Ly^s u y = EjLi max ( u y(l ~~ s j(.f))ity > an d denoting by ^j nt the set of integer flows, we can equiv- 
alently rewrite the optimization problem 

mini £ / MV c KV + £^max(u^(l-jy(/)),0) I. 

/G/int {{u,v)€E' j=\ Z J 

It is easy to transform this minimum cost flow problem with piecewise linear costs to a classical 
minimum cost flow problem (see Bertsekas, 1998, Exercise 1.19) with integral constraints. There- 
fore, it is possible to remove the constraint / G j7j nt and replace it by / € J- without changing the 
optimal value of the cost function, leading to the formulation proposed in Equation (11). ■ 



D.3 Proof of Proposition 4 

Proof. Without loss of generality, let us suppose that u is in MF + . Let us indeed denote by w* = 
Prox,^ [u] . It is indeed easy to see that the signs of the entries of w* are necessary the same as those 
of u, and flipping the signs of some entries of u results in flipping the signs of the corresponding 
entries in w*. According to Proposition 2, we can write the proximal problem as 



L ^ 7=1 (u,v)€E> 



min |oL( u J- ff /) + L /»vC«v S.t. Sj(f) >Wy,V/e{l,...,p} } . 

When / is fixed, minimizing with respect to w yields for all j, = min (a.j,Sj (/*)). Plugging this 
closed form in the above equation yields the desired formulation. ■ 
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D.4 Proof of Proposition 5 

Proof. We recall that according to Lemma 1 we have for all w in IR+ 



Y&,(w) = min I r| T x s.t. Nx > w > . 



This is a linear program, whose dual (see Nocedal and Wright, 2006) gives us another definition 
for \\fg on M^. Since strong duality holds here, we have 

¥<5 ( w ) = max \ wTk s -t- N t k < r| > . 

It is easy to show that one can extend this definition on W } such that we have 

\|/ c (w) = max I w T K s.t. max ^ Kg ^ < 1 i . (19) 

where K^, denotes the vector of size \g | containing the entries of K corresponding to the indices in the 
group g. Note that a similar formula appears in (Jacob et al., 2009, Lemma 2), when the ^-norm is 
used in place of the l^. We now define for a vector K in W , 



\if r (k) = max 6 



It is easy to see that it is a norm, and by Equation (19), this is in fact the dual norm of the norm \\fg p . 
We can now rewrite it as 



\\f*r (k) = min |x s.t. max ^ Kg ^ <xl 
Y ^' A xeR + \ g eg„ Tjg - J 



K J i 

min a s.t. max — ti„<0 

1 geg„ X s 



= min < x s.t. min / T (#) > 

ieiR + [ geq p 

where we have identified the groups in Q p to their corresponding (s,f)-paths in G' . 



D.5 Proof of Proposition 6 
Proof. 

Correctness: 

We start by showing that when the algorithm converges, it returns the correct solution. We remark 
that the choice of x in the algorithm ensures that there always exists a group h in Q p such that / T (h) = 
and therefore we always have 8 < 0. Thus, when the algorithm converges, 8 is equal to zero. 
Moreover, the function G : x — > min/, e ^ l x (h) is non-increasing with x since the functions x — > l x (h) 
are themselves non-increasing for all h in Q p . It is also easy to show that there exists a unique x 
such that G(x) = 0, which is the desired solution. We conclude by noticing that at convergence, we 
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have G(x) = 8 = 0. 
Convergence and complexity: 

We now show that the algorithm converges and give a worst-case complexity. We denote by x&, gu 
and 8* the respective values of %,g and 8 at the iteration k of the algorithm. The definition of x& + i 
implies that 

h k+l {gk)=Q = h k {gk) + \\K gk \\\ 



Ik 1k+l 

§*<0 * — 



Moreover, 



-5 k >0 

h+l = h k+l (gk+l) = h k (gk+l) + \\Kg k+1 \\l(— ■ 



^k 

Since l Xk (gk+i) > $k (8/t is the length of the shortest path), we can show that 



5*+i >8* 1 



11% Hi 

Since 6^ + i < 0, we remark that necessarily \\K gM \\\ < ||K gi ||i, and we have two possibilities 

1. either ||k^ +1 ||i = ||K gil ||i and 8& + i = 0, meaning that the algorithm has converged. 

2. either ||K gt+1 1| i < \\K gk || i and it is easy to show that is implies that r\ gk+l < T\ gk . 

Since r)/, = y+ \h\, we obtain that r\ gk is strictly decreasing with k before the convergence of the 
algorithm. Since it can have at most p different values, the algorithm converges in at most p itera- 
tions. Updating the path g in the algorithm can be done by solving a shortest path problem in the 
graph G' , which can be done in 0(|£|) operations since the graph is acyclic (Ahuja et al., 1993), 
and the total worst-case complexity is 0(p\E\), which concludes the proof. ■ 



D.6 Proof of Proposition 7 

Proof. We denote by K the quantity K = VL(w), and respectively by K and w the vectors recording 
the entries of K and w that are in V. 
Convergence of the algorithm: 

Convergence of the algorithm is easy to show and consists of observing that G is strictly increasing. 
After solving subproblem ( 1 4), we have from the optimality conditions of Lemma 2 that \|/^ (ic) < 

VP 

By definition of the dual-norm given in Proposition 5, and using the same notation, we have that 
for all g in Q p , l\{g) > 0. We now denote by x the quantity x = \|/^ (k); if x < X, the algorithm 

stops. If not, we have that for all g in Q p , l x (g) > (since x > A. and lx(g) > for all g in Qp). The 
step g <— aigmin ge g p l z (g) then selects a group g such that l z (g) = (which is easy to check given 
the definition of \|/^ in Proposition 5. Therefore, the selected path g is not in G, and the size of G 
strictly increases, which guarantees the convergence of the algorithm. 
Correctness: 

We want to show that when the algorithm stops, it returns the correct solution. First, if we have G = 
G, it is trivially correct. If it stops with G ^ G, we have that \|/^ (k) < X, and according to 
Lemma 2, we only need to check that — K T w = X\|/^ /; (w). We remark that we have X\|/^(w) < 
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X\\fg (w) = — K T w = — K T w < \|/^(k)\|/^(w), where the first inequality is easy to show when ob- 
serving that Q p C Q p , and the last inequality is the generalized Holder inequality for a norm and its 
dual-norm. Since \|/^ (k)\|/^ p (w) < fa\fg p (w) we have in fact equality, and we conclude the proof. ■ 

References 

N. Ahmed, T. Natarajan, and K.R. Rao. Discrete cosine transform. IEEE Transactions on Comput- 
ers, C-23(l):90-93, 1974. 

R.K. Ahuja, T.L. Magnanti, and J.B. Orlin. Network Flows. Prentice Hall, 1993. 

H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second 
International Symposium on Information Theory, volume 1, pages 267-281, 1973. 

F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in 
Neural Information Processing Systems (NIPS), 2008. 

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. 
Foundation and Trends in Machine Learning, 4:1-106, 2012. 

A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and model- 
ing. IEEE Transactions on Information Theory, 44(6):2743-2760, 1998. 

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse prob- 
lems. SI AM Journal on Imaging Sciences, 2(1): 183-202, 2009. 

E. Bernard, L. Jacob, J. Mairal, and J.-R Vert. Efficient RNA isoform identification and quantifica- 
tion from RNA-Seq data with network flows, preprint hal-00803134, 2013. 

D.R Bertsekas. Network Optimization: Continuous and Discrete Models. Athena Scientific, 1998. 

T. Blumensath and M.E. Davies. Iterative hard thresholding for compressed sensing. Applied and 
Computational Harmonic Analysis, 27(3):265-274, 2009. 

M. Boguna, R. Pastor-Satorras, A. Diaz-Guilera, and A. Arenas. Models of social networks based 
on social distance attachment. Physical Review E, 70(5):056122, 2004. 

S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. 

Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 23(11): 1222-1239, 2001. 

A. Buades, B. Coll, and J.M. Morel. A review of image denoising algorithms, with a new one. 
SIAM Multiscale Modelling and Simulation, 4(2):490, 2005. 

V. Cehver, M. Duarte, C. Hedge, and R. G. Baraniuk. Sparse signal recovery using Markov random 
fields. In Advances in Neural Information Processing Systems (NIPS), 2008. 



34 



Feature Selection in Graphs with Path Coding Penalties 



A. Chambolle and J. Darbon. On total variation minimization and surface evolution using parametric 
maximal flows. International Journal of Computer Vision, 84(3):288-307, 2009. 

S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM 
Journal on Scientific Computing, 20(1):33-61, 1999. 

X. Chen, Q. Lin, S. Kim, J. Pena, J.G. Carbonell, and E.P. Xing. Smoothing proximal gradient 
method for general structured sparse learning. In Proceedings of the Twenty-Seven Conference 
on Uncertainty in Artificial Intelligence (UAI), 2011. 

H. Y. Chuang, E. Lee, Y.T. Liu, D. Lee, and T. Ideker. Network-based classification of breast cancer 
metastasis. Molecular Systems Biology, 3(140), 2007. 

T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to algorithms. MIT Press, 
2001. 

C. Couprie, L. Grady, H. Talbot, and L. Najman. Combinatorial continuous maximum flow. SIAM 
Journal on Imaging Sciences, 4:905-930, 2011. 

T.M. Cover and J. A Thomas. Elements of information theory. Wiley, 2006. 2nd edition. 

K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3D transform- 
domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080-2095, 2007. 

I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse 
problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11): 
1413-1457, 2004. 

M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned 
dictionaries. IEEE Transactions on Image Processing, 54(12):3736-3745, 2006. 

J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. 
Journal of the American Statistical Association, 96(456):1348-1360, 2001. 

L.R. Ford and D.R. Fulkerson. Maximal flow through a network. Canadian Journal of Mathematics, 
8(3):399-404, 1956. 

P. Gleiser and L. Danon. Community structure in jazz. Advances in Complex Systems, 6(4):565- 
573, 2003. 

A.V. Goldberg. An Efficient Implementation of a Scaling Minimum-Cost Flow Algorithm. Journal 
of Algorithms, 22(1): 1-29, 1997. 

A.V. Goldberg and R.E. Tarjan. A new approach to the maximum flow problem. In Proceedings of 
the ACM Symposium on Theory of Computing, 1986. 

R. Guimera, L. Danon, A. Diaz Guilera, F. Giralt, and A. Arenas. Self-similar community structure 
in a network of human interactions. Physical Review E, 68(6):065103, 2003. 

D. S. Hochbaum. Complexity and algorithms for nonlinear optimization problems. Annals of Oper- 
ations Research, 153(l):257-296, 2007. 



35 



Mairal and Yu 



H. Hoefling. A path algorithm for the fused lasso signal approximator. Journal of Computational 
and Graphical Statistics, 19(4):984-1006, 2010. 

J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. Journal of Machine Learn- 
ing Research, 12:3371-3412, 2011. 

D.R. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 58(l):30-37, 
2004. 

L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings 
of the International Conference on Machine Learning (ICML), 2009. 

R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. 
Journal of Machine Learning Research, 12:2777-2824, 2011. 

D.J.C. MacKay. Information theory, inference, and learning algorithms. Cambridge University 
Press, 2003. 

J. Mairal. Sparse coding for machine learning, image processing and com- 

puter vision. PhD thesis, Ecole Normale Superieure de Cachan, 2010. 

http : / /tel . archives -ouvertes .fr/tel-00595312. 

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image 
restoration. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 
2009. 

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse 
coding. Journal of Machine Learning Research, 11:19-60, 2010. 

J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network flow optimization for 
structured sparsity. Journal of Machine Learning Research, 12:2649-2689, 2011. 

S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEE Transactions on 
Signal Processing, 41(12):3397-3415, 1993. 

N. Meinshausen and P. Biihlmann. Stability selection. Journal of the Royal Statistical Society: 
Series B (Statistical Methodology), 72(4):417-473, 2010. 

A. Nemirovsky and D. Yudin. Problem complexity and Method Efficiency in Optimization. Wiley, 
1983. 

Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, 
CORE Discussion paper, 2007. 

J. Nocedal and S.J. Wright. Numerical optimization. Springer Verlag, 2006. 2nd edition. 

G. Obozinski and F. Bach. Convex relaxations for combinatorial penalties. Technical report, 2012. 
preprint arXiv:1205.1240vl. 

M. R. Osborne, B. Presnell, and B. A. Turlach. On the Lasso and its dual. Journal of Computational 
and Graphical Statistics, 9(2):3 19-37, 2000. 



36 



Feature Selection in Graphs with Path Coding Penalties 



J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures 
of Gaussians in the wavelet domain. IEEE Transactions on Image Processing, 12(1 1): 1338-1351, 
2003. 

F. Rapaport, A. Zinovyev, M. Dutreix, E. Barillot, and J.-P. Vert. Classification of microarray data 
using gene networks. BMC Bioinformatics, 8(1):35, 2007. 

J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465^171, 1978. 

M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for 
convex optimization. In Advances in Neural Information Processing Systems (NIPS), 2011. 

G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461^1-64, 1978. 

R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical 
Society: Series B, 58(l):267-288, 1996. 

B.A. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Technometrics, 47 
(3):349-363, 2005. 

M.H. Van De Vijver et al. A gene-expression signature as a predictor of survival in breast cancer. 
The New England Journal of Medicine, 347(25): 1999-2009, 2002. 

S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE 
Transactions on Signal Processing, 57(7):2479-2493, 2009. 

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal 
of the Royal Statistical Society: Series B, 68:49-67, 2006. 

P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical 
variable selection. Annals of Statistics, 37(6A):3468-3497, 2009. 

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal 
Statistical Society: Series B, 67(2):301-320, 2005. 



37 



